perlunicode: Add discussion about malformations #23553

khwilliamson · 2025-08-09T17:38:06Z

And especially the REPLACEMENT CHARACTER

jkeenan · 2025-08-10T11:42:28Z

Until reading this, I had never heard of the "Unicode REPLACEMENT CHARACTER (U+FFFD)", even though I'd seen it thousands of times. So I'm not in a position to confirm the specifics of this p.r. However, once we review that, this can go in.

bulk88 · 2025-08-10T16:17:05Z

pod/perlunicode.pod

+change to counter.  (Although no new ones have become known recently
+that the Unicode Standard wasn't prepared for.)  And CPAN modules can
+easily lag behind the interpreter itself.
+


Everything above is perfect, dont change it.

bulk88 · 2025-08-10T16:34:51Z

pod/perlunicode.pod

+
+    binmode $fh, ":encoding(UTF-16)";
+
+on these.  See L<Encode> for more information.


I highly disagree with this. UTF8 flag on is very rarely seen inside the interp. I've seen 4 SW projects/code bases in my whole life that had all variables in a non latin script. There is absolutely nothing wrong with doing that. Although someones career outlook is likely to be a box on the sidewalk if the leave their country, having full memorization of python/C/Perl langs, in their native non-latin human languages.

In addition, severe performance penalties come with turning on the UTF8 flag. The SVt_PV-SVt_PVLV structs are incapable of storing a SVCURU() length. Not to mention index() and FBM/Boyer–Moore logic in the interp instantly drop dead/drop out (correct me if im wrong).

For GUI strings, yes, you 100% need UTF8 turned on to never slice open an emoji or skin color modifier or country flag emoji.

But UTF8 flag on passwords, network addresses, things that will become unprintable fixed length binary byte arrays, like C types short int and long and size_t, \w+ things that become hash keys or JSON field names, absolutely not.

Western Europe has a non-violent war going on for the last couple decades for their dots apostraphes and tails characters on 26 letter USA english. 2 code points or 1 code point? Denormalization, overlong.

Not to mention yearly Unicode Comission updates that can rewrite fields in your SQL DB/name upper casing/lower casing/alpha sorted CSV files better than any malware software from .

Unless its a GUI, or meaning less to transistors varlen human written freeform text. UTF8 logic needs to stay out of that library. Its a byte array, Permutations 0-255, 0-65K, 0-4 billion. that string has no other meaning to SW.

UTF-8 is how Unicode strings are stored internally by the interpreter. is really the only sentence I disagree with.

And on most platforms, most inputs, such as files, will also be in UTF-8. 100% correct, Latin-1 and weird ROM chips from the 80s from other continents for your EGA video video card are as extinct as a PDP11. Latin-1 does not circulate anywhere in production code anymore.

I'm trying to understand your point here. The text does not say anything about the UTF-8 flag. It is just giving the only way that a file encoded as UTF-8 currently can be read and be checked by the system for well-formedness.

Fast Boyer Moore is continued to be used for UTF-8. I don't know that the UTF-8 flag is automatically set by Encode if not necessary. I believe it is set only if necessary.

I don't know that the UTF-8 flag is automatically set by Encode if not necessary. I believe it is set only if necessary.

I'm not sure it always sets the flag, but it does set it when unnecessary:

$ perl -MEncode=decode,FB_CROAK -MDevel::Peek -e 'my $x = decode("UTF-8", (my $src = "abc"), FB_CROAK); Dump($x)' SV = PV(0x5588f3fb85e0) at 0x5588f3e420d0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x5588f403bf50 "abc"\0 [UTF8 "abc"] CUR = 3 LEN = 10

More important to user-facing documentation, Perl (and by extension Encode) makes no guarantees whether the flag will be set when it is not necessary.

bulk88 · 2025-08-10T16:40:59Z

pod/perlunicode.pod

+can be returned when a malformation is encountered.  This could
+conceivably lead to trojan strings where the second, trojan, part is
+hidden from code that is expecting a NUL-terminated string.
+


Maybe this "improvement" is self-destructive and a bad idea to put in this wiki article, but someone could link to the 5-10 CVEs and RT tickets in the last 15 years about data smuggling with a null char in the middle of a byte array aka string.

Yes, 90% of current downloaded/indexed CPAN code is using encoding unaware getter setter Perl API methods. 20%-45% of tarballs, a UTF8/L1, along with custom written/malicious/insane/non-production/illogical PP script, will cause some kind of trivial waste of time to fix bug or severe business commercial corporate $ making losses bug. SEGVs, timeouts, 504 gateway timeout, CPU infinite loops. Remote code execution/priv escalation is probably paranoia. but taking down a bunch of rack servers until someone attaches perl5db.pl, yeah, thats $ loss.

I'm trying to impress upon the reader that not upgrading to 5.44 when it comes out has a risk associated with it. Now that I'm looking again at what got in to 5.42, that's probably good enough. So I'm open to what the consensus here becomes

bulk88 · 2025-08-10T16:45:15Z

pod/perlunicode.pod

+Note that finding a REPLACEMENT CHARACTER in your string doesn't
+necessarily mean there is an attack.  It is a perfectly legal input
+character, for whatever reason.
+


I disagree, WinPerl's generic C coding policy is and MS's help docs say finding REPLACEMENT CHARACTER is I/O error, the SATA/SCSI/IDE cable was yanked, no more information is available. Imagine feeding all of BMP, or 100s of invalid utf8 surrogates into a state machine/de-dupe logic/HV* hash, and every last string pops out the front end or backend end of the API being memcmp(username1, username2, len) == 0 identical. Yet they were all unique different customers/ip addresses/zip codes/email addresses 1 millisecond ago.

What I meant is that TUS does not prohibit someone from placing the REPLACEMENT CHARACTER in Unicode strings. It probably is a bad idea, but it isn't illegal

bulk88 · 2025-08-10T17:27:01Z

pod/perlunicode.pod

+Unicode equivalents for most, but not all characters in it.  You just
+use the REPLACEMENT CHARACTER for the missing ones.  As long as most of
+the text is translatable, the results could be intelligible to a human
+reader.



ыОУ СХОУЛД ГИЖЕ АН ХОНЕРАБЛЕ МЕНТИОН ТО КОИ╦ АС ТХЕ ОНЛЫ КОДЕ ПАГЕ ВХЕРЕ ТХИС ИС ТРУЕ

I'm unsure of your point here. I happen to be able to read Cyrillic, and surprisingly several of the words you gave here are pronounceable. My point is if you have valid text in some encoding that you are translating to Unicode, but when you encounter a character that doesn't have a Unicode TUS explicitly says to substitute the REPLACEMENT CHARACTER in your translation, rather than doing anything else. (Maybe you could return failure, I suppose.)

My opinion is 0xFFEE and low 7b "?" is not readable with human eyes. Its not an overcompressed jpeg image, or what I did with KOI8 bitwise math logic up above. Its a bad disk sector, you will never know what was behind the square, and no amount of $ to a data recovery firm will get that character back, or give you a good enough guess what it used to be.

If its AI generated auto captions, "???" "..." or nowadays AI algos just print "*music*" for 8 minutes instead of low 7b "?" on the screen. I just dont want anyone to think a 0xFFEE or a '?' in a CSV file is ever acceptable coding practices and to walk away and move on with daily life after seeing it for half a second in a system tracing log/sql table/dev tools console.

pod/perlunicode.pod

bulk88 · 2025-08-10T18:08:15Z

pod/perlunicode.pod

-the new world of Unicode, upgrading when necessary.
-If your legacy code does not explicitly use Unicode, no automatic
-switch-over to Unicode should happen.
-


This paragraph has got to go,its humor in 2025 and a chuckle, but def not for a college student to read. Latin-1 is a synonym for raw binary/hex dumps. L1 isn't a transport protocol for data exchange anymore. Latin 1 doesn't exist over a copper wire longer than a USB cable.

Most IDEs/OSes should probably permanently switch Latin-1 fonts drawing "known" Latin-1 text, with tiny hexadecimal emoji font characters. I've used those tiny hexadecimal font files in the recent past. IDN KOI8R were very good solutions at the Unicode homonym glyph security attacks.

When in double, start drawing \xff ascii escape codes to a non-IT medical worker. Atleast then you can record the patients obvious wrong name, but ZERO DATA loss name with a ball point pen. Thats the USA Social Security Admin's official policy BTW, and Chinese has deterministic 2 way reversible latin-izing protocol for the last couple decades, and WWW IDN spec's rules.

Thats also WinPerl's official legal policy located at
https://github.com/Perl/perl5/blob/blead/win32/win32.c#L5148
and
https://github.com/Perl/perl5/blob/blead/win32/win32.c#L2284

You might wanna expand this article with basic data scrubbing sanitization algorithms, like locks all JSON fields to exactly 1 govt recognized script, no mix and match between chinese and english in a JSON string. Thats a solution I've heard circulate in the perl community a couple times.

One Perl company had to pay language consultants to sanitize the official Unicode commission database, since those character property are best effort, and not actually "secure" with real world written text in a newspaper, or a sign on the wall in that country. Letters removed by a Dept of Ed/Ministry of Education in the 1930s-1960s, should never enter a SQL DB field in the 2020s.

Something is wrong when the last birth certificate or govt ID ever issued with that letter was 112 years ago and that 112 year old person is registering a new account with you.

Something is wrong when the last birth certificate or govt ID ever issued with that letter was 112 years ago and that 112 year old person is registering a new account with you.

That's a problem for the application developer, not the interpreter. Not the Unicode standard. Not the JSON library.

If I want to write about this 112 year old person in comments, or in my code, or on a web page, or even with pen and paper it kinda feels like my programming language shouldn't give me a <?>. It should give me the text I or my user entered.

This bunny is existing is not a crime:

ᕬ ᕬ ( ᴗ͈ᆺᴗ͈ ) /︎︎ ︎づ =͟͟͞͞ ♡

tonycoz · 2025-08-11T00:40:33Z

pod/perlunicode.pod

+domain names need to be careful to not give out ones that spoof other ones,
+(examples in L<perlre/Script Runs>).


the rest of a sentence after a comma being parenthetical seems strange to me .

Agreed; I'll fix before pushing

tonycoz · 2025-08-11T00:50:39Z

pod/perlunicode.pod

+interpreter when working with Unicode.  Not unil Perl v5.44 is it fully
+hardened against known attack vectors.  And who knows what new ideas
+clever atackers may come up with in the future, that we will have to


Will any of these be added to ppport.h?

I'm working on that now

perlunicode: Add discussion about malformations

0a176b6

khwilliamson mentioned this pull request Aug 9, 2025

sv_vcatpvfn_flags: Use utf8_to_uv #23083

Merged

bram-perl approved these changes Aug 9, 2025

View reviewed changes

bulk88 reviewed Aug 10, 2025

View reviewed changes

pod/perlunicode.pod Show resolved Hide resolved

bulk88 reviewed Aug 10, 2025

View reviewed changes

tonycoz reviewed Aug 11, 2025

View reviewed changes


		binmode $fh, ":encoding(UTF-16)";

		on these. See L<Encode> for more information.

		domain names need to be careful to not give out ones that spoof other ones,
		(examples in L<perlre/Script Runs>).

perlunicode: Add discussion about malformations #23553

Are you sure you want to change the base?

perlunicode: Add discussion about malformations #23553

Uh oh!

Conversation

khwilliamson commented Aug 9, 2025

Uh oh!

jkeenan commented Aug 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!